library(tidyverse)
library(dplyr)
library(janitor)
library(here)
library(wordcloud)
library(RColorBrewer)
library(patchwork)
library(ggwordcloud)
library(paletteer)
# figure out how to hide the text output and keep the code, ideally allowing the reader to open and close the code in the knitted document through some variation of: include = FALSE, collapse = TRUE, class.source = 'fold-hide', results = 'hide'
Hi there! This post is coming to you from Juliet Cohen and Scout Leonard! We are both students at UCSB’s Bren School of Environmental Science & Management in the inaugural cohort of the Masters of Environmental Data Science (MEDS) Program.
Scout is interested in growing as an environmental data scientist after having worked with large datasets through food system and food security work in Oakland, California. Already, she is so pleased with the emphasis her MEDS courses have placed on responsible data science, as she hopes to use her MEDS toolkit to analyze data to influence food security and resource use policy that builds sustainable, equitable food systems. You can learn more about her previous and ongoing work at her website: scoutcleonard.github.io.
Juliet is interested in applying environmental data science to wildlife biology and the interaction between humans and endangered species’ natural habitats. Juliet was inspired by her experience serving as a field technician in California and Hawaii. Throughout the MEDS summer curriculum, Juliet strengthened her collaborative programming skills and looks forward to learning about spatial analysis and modelling large, dynamic data sets. She hopes to contribute to open-source projects in the future.
For the first 6 weeks of our degree, we embarked on a whirlwind on introductory data science. From 9 AM to 5 PM we sat in the newly constructed MEDS classroom at the National Center for Ecological Analysis and Synthesis (NCEAS) in downtown Santa Barbara, learning the basics we needed to jumpstart our data science degrees. These week to two-week classes consisted not only of lectures, but of coding labs, collaborative team science projects, and individual and group presentations. We also had “flex sessions” for non-course content, like panels from various data scientists at NCEAS and representatives from local groups of R users, such as Santa Barbara R Ladies and Eco Data Science. Our summer term laid the foundation not only for the codes we’ll write this year, but also for how we create workflows and collaborate as growing scientists. These courses included:
After this intense MEDS summer, Juliet and Scout took some time to relax before fall quarter, but we also took this blog-post-writing opportunity to showcase some of what our class has learned so far.
As we reflected on the first quarter of our degree, we decided it may be more interesting to show, rather than tell, some of the skills we’ve learned. We also thought it might be fun to share reflections from our whole cohort to truly represent the student experience of this fast-paced, learning-filled summer.
As such, we developed a survey (in Google forms) to send to our classmates and gather data about their perspectives on these first six weeks. We wondered about two main questions: 1.) what did the cohort think of our classes? and 2.) what kinds of fun things has the group been up to in our delightful home of Santa Barbara?
We developed a Google form for our peers to give feedback about these wondering. Our questions were developed with data tidying and visualization in mind, and when we got some unexpected answers, we developed perspective about how our survey could have been better to help us with a bit less wrangling. MEDS students, however, do not shy away from problematic data, so we persisted with the problems that arose :)
The following is a description of how we visualizes the most interesting survey data submitted by our peers, including neat graphics describing our MEDS summer, from tidying data to surfing after class. We hope you find it insightful, but also fun :)
First, we read the data from our Google form survey, which we were able to download as a .csv file, into the R project we made for this blog project.
data <- read_csv(here("data", "MEDS Summer Reflection Survey (Responses) - Responses Clean.csv"))
Next, we renamed the columns. Google forms does this frustrating, but understandable thing where the names of the columns of data are the questions we asked participants. This makes for super long names that are not fun to write code with. We instead named the columns of interest with the corresponding order of our coursework, i.e. column 1 is our first course, EDS 212.
#colnames(data) renames the columns
data_clean <- data %>%
rename("1" = "Write 3 words to describe or represent week 1 (EDS212 w/ Allison) here:") %>%
rename("2_3" = "Write 3 words to describe or represent weeks 2 & 3 (EDS221 w/ Allison) here:") %>%
rename("4" = "Write 3 words to describe or represent week 4 (EDS214 w/ Julien) here:") %>%
rename("5" = "Write 3 words to describe or represent week 5 (EDS215 w/ Frew) here:") %>%
rename("6" = "Write 3 words to describe or represent week 1 (EDS216 w/ Scott) here:")
The Google form format included five questions (one for each summer course) where students wrote in three words to describe how they felt about the course. In the .csv of the survey data, the three words submitted by a participant were grouped together in one cell per course.
To visualize how often certain descriptors for each summer course appear, we first needed to separate the three terms submitted for each course into separate observations. We executed this using the separate_rows() function. This expanded the terms into three separate observations per student in each class column.
# separate the columns into rows by parsing the 3 words in each observation into 3 different observations
# select certain cols because our first data viz is only using certain cols
data_clean_1 <- data_clean %>%
separate_rows("1") %>%
select("Email Address","1")
data_clean_2_3 <- data_clean %>%
separate_rows("2_3")%>%
select("Email Address","2_3")
data_clean_4 <- data_clean %>%
separate_rows("4") %>%
select("Email Address","4")
data_clean_5 <- data_clean %>%
separate_rows("5") %>%
select("Email Address","5")
data_clean_6 <- data_clean %>%
separate_rows("6") %>%
select("Email Address","6")
Then, to observe the frequency of words used to describe each course, we used the table function. This creates a table that has two columns: one for each distint descriptive word students used and the frequency with which the words occur for the class. Then, we converted that table to a data frame.
# use the table() function to take counts of each "factor" (words) and use the data.frame() function to convert these tables to data frames
course_1 <- data.frame(table(data_clean_1$"1"))
course_1_df <- as.data.frame.matrix(course_1)
course_2_3 <- data.frame(table(data_clean_2_3$"2_3"))
course_2_3_df <- as.data.frame.matrix(course_2_3)
course_4 <- data.frame(table(data_clean_4$"4"))
course_4_df <- as.data.frame.matrix(course_4)
course_5 <- data.frame(table(data_clean_5$"5"))
course_5_df <- as.data.frame.matrix(course_5) %>%
filter(!Var1 == "tangent") %>%
filter(!Var1 == "tangents") %>%
filter(!Var1 == "dry")
course_6 <- data.frame(table(data_clean_6$"6"))
course_6_df <- as.data.frame.matrix(course_6)
After this, we wanted to visualize the frequency of words for each class. We used ggplot to create word clouds which represented the frequency of class descriptors. The word clouds display the descriptive words submitted by our classmates in sizes proportional to the frequency the words were used. The ggplot plot we used is called ggwordcloud. We updated the colors by adding an aesthetic feature where each class descriptor is represented by a different color. We also updated the word size so that the cloud was easier to read.
Since we used ggplot quite a bit in EDS 221, we opted for this method for generating word clouds over another package called wordcloud, which is specifically for word clouds. We found that we wanted to showcase our visualization skills by stacking visualizations and adding titles and colors, and it was easier to do this in a ggplot version of word clouds, since we had so much practice with other plots in ggplot.
The code chunk below is where we tried the wordcloud package. It made the correct visualizations, but we found it more difficult to make them as nice as some of the ggplots we made in class. It was a relief to learn that there is a word cloud feature in ggplot!
And finally, the ggplot word clouds!
# for each cloud, specify the background color within the theme to match the background color of the blog
cloud_1 <- ggplot(course_1_df, aes(label = Var1, size = Freq, color = Var1)) +
geom_text_wordcloud() +
scale_size_area(max_size = 25) +
theme(plot.title = element_text(size = 25),
panel.background = element_rect(fill = "white")) +
labs(title = "Week 1 EDS 212: Essential Math in Environmental Data Science")
cloud_1
cloud_2_3 <- ggplot(course_2_3_df, aes(label = Var1, size = Freq, color = Var1)) +
geom_text_wordcloud() +
scale_size_area(max_size = 25) +
theme(plot.title = element_text(size = 25),
panel.background = element_rect(fill = "white")) +
labs(title = "Weeks 2 & 3 EDS 221: Scientific Programming Essentials")
cloud_2_3
cloud_4 <- ggplot(course_4_df, aes(label = Var1, size = Freq, color = Var1)) +
geom_text_wordcloud() +
scale_size_area(max_size = 25) +
theme(plot.title = element_text(size = 25),
panel.background = element_rect(fill = "white")) +
labs(title = "Week 4 EDS 214: Analytical Workflows and Scientific Reproducibility")
cloud_4
cloud_5 <- ggplot(course_5_df, aes(label = Var1, size = Freq, color = Var1)) +
geom_text_wordcloud() +
scale_size_area(max_size = 25) +
theme(plot.title = element_text(size = 25),
panel.background = element_rect(fill = "white")) +
labs(title = "Week 5 EDS 215: Introduction to Data Storage and Management")
cloud_5
cloud_6 <- ggplot(course_6_df, aes(label = Var1, size = Freq, color = Var1)) +
geom_text_wordcloud() +
scale_size_area(max_size = 25) +
theme(plot.title = element_text(size = 25),
panel.background = element_rect(fill = "white")) +
labs(title = "Week 6 EDS 216: Meta-Analysis and Systematic Reviews")
cloud_6
# use patchwork to stack the graphs
cloud_1 / cloud_2_3 / cloud_4 / cloud_5 / cloud_6
# download cvs for SB activities
data_activities <- read_csv(here("data", "sb_activities_data.csv"))
# make data.frame for SB activities histogram
activities_clean <- data.frame(table(data_activities$"sb_activities"))
#SB_activities <- ggplot(activities_clean, aes(x = Var1, y = Freq)) +
# geom_bar() +
# coord_flip() +
# labs(title = "MEDS Favorite Santa Barbara Activities",
# y = "Activity")
SB_activities <- ggplot(activities_clean, aes(y = reorder(Var1, +Freq), x = Freq)) +
geom_histogram(stat = 'identity', aes(fill = Var1, color = "blue")) +
scale_fill_paletteer_d("dutchmasters::milkmaid") +
theme(legend.position = "none",
panel.grid = element_blank(),
panel.background = element_rect(fill = "white")) +
labs(title = "MEDS Favorite Santa Barbara Activities",
y = "Activity",
x = "Total Votes")
SB_activities
#panel.grid = element_blank()
The MEDS 2022 cohort gathered at NCEAS downtown after class during summer session.
Members of the MEDS cohort in downtown Santa Barbara celebrating completing the first half of summer session classes with faculty and their pets.
“Expanding my coding fundamentals, tidy Tuesday’s, building a portfolio and updating my website, and working collaboratively on the capstone project.”
“I’m excited to build on the foundation we created this summer!”
“Excited to work with some spatial data in Frew’s next class!”
“I’m looking forward to learning more skills in data science and be able to apply them to assignments and projects in our classes. I’m also looking forward to learning more about potential future careers in environmental data science.”